Introduction

In this analysis report I would like to find some patterns or characteristics that make some players the best.
After analysing the data I will use data analysis and statistics techniques using the libraries Pandas, Numpy and Matplotlib to manage datasets, make vectorized calculations and create visualisations to help understand the results.

Questions

  1. Which school did best players go to?
  2. Did best players get highest salaries?
  3. Is there a correlation between best teams and managers?

In [1]:
# import libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%pylab inline

from bokeh.io import output_notebook, show
from bkcharts import Donut
output_notebook()


Populating the interactive namespace from numpy and matplotlib
Loading BokehJS ...

Intro to Data Analysis

Final Project: Investigate Data

In this project I will:

  1. Choose a dataset (titanic, baseball)
  2. Analyse the data
  3. Make questions based on the analysis
  4. Report found answers

In [2]:
# declare helper functions and load dataset

def load_files(data_path, filenames):
    """Load data files
    
    Loads every csv file from the data directory.
    
    Args:
        data_path(string): Path to files
        filenames(list): List of files to load from the data_path
        
    Returns:
        dict: A dictionary of table names as keys and DataFrame objects as values
    """
    dfs = {}
    for f in filenames:
        parts = f.split('.')
        if parts[1] != 'csv':
            continue
        dfname = parts[0]
        absfilename = data_path + '/' + f
        dfs[dfname] = pd.read_csv(absfilename)
    return dfs

def describe_df(df):
    """Describe a DataFrame
    
    Prints DataFrame information, the result of describe() method, and a blank line underneath.
    
    Args:
        df(DataFrame): The DataFrame to be described
    """
    print(df)
    display(dataframes[df].describe())
    print('')

def describe_dataframes(dataframes):
    """Describe a list of DataFrame objects
    
    Args:
        dataframes(list): A list of DataFrame objects
    """
    for df in dataframes:
        describe_df(df)
        
def display_1d_array(array, col_name='Data'):
    """Display 1D Numpy array nicely
    
    Args:
        array (np.array): 1D Numpy array
        col_name (string, optional): Name for column to display
    """
    display(pd.DataFrame(array, columns=[col_name], index=[i+1 for i in range(len(array))]))

In [3]:
# declare visualization functions

def distribution(data, figsize=(7, 5), color='c', xlabel='X', ylabel='Y', title=None):
    """Draw a Seaborn Distribution plot
    
    Draws a distribution plot setting proper size and title
    
    Args:
        data(DataFrame): Data to be used to draw
        color(string, optional): Single character that represents a colour setting. Defaults to c
        xlabel(string, optional): Label for X axis. Defaults to X
        ylabel(string, optional): Label for Y axis. Defaults to Y
        title(string, optional): Title to be displayed on top of the plot. Defaults to None
    
    .. _Seaborn Documentation:
        https://seaborn.pydata.org/generated/seaborn.distplot.html
    
    """
    f, axes = plt.subplots(1, 1, figsize=(7, 5), sharex=True)
    sns.despine(left=True)
    sns.distplot(data, color=color, kde_kws={"shade": True})
    axes.set(xlabel=xlabel, ylabel=ylabel)
    plt.setp(axes, yticks=[])
    plt.tight_layout()
    if title is not None:
        f.suptitle(title)


def plot_correlation(x, y, xlabel, ylabel, legend=None, title=None):
    """Draw a Seaborn Regplot plot
    
    Plot data and a linear regression model fit.
    
    Args:
        x(array): X data in Numpy array format
        y(array): Y data in Numpy array format
        xlabel(string): Label for X axis
        ylabel(string): Label for Y axis
        legend(string, optional): Title to be displayed inside the plot. Defaults to None
        title(string, optional): Title to be displayed on top of the plot. Defaults to None
        
    .. _Seaborn Documentation:
        https://seaborn.pydata.org/generated/seaborn.regplot.html
    """
    if title is not None:
        fig = plt.figure()
        fig.suptitle(title)
    ax = sns.regplot(x="x", y="y", data=pd.DataFrame({'x': x, 'y': y}), label=legend, x_jitter=.2)
    ax.set(xlabel=xlabel, ylabel=ylabel)
    if legend is not None:
        ax.legend(loc="best")


def plot_correlation2(data, x_var, y_var, title=None):
    """Draw a Seaborn Joinplot plot
    
    Draw a plot of two variables with bivariate and univariate graphs.
    
    Args:
        data(DataFrame): Data to draw the plot
        x_var(string): Name of variable for X axis
        y_var(string): Name of variable for Y axis
        title(string, optional): Title to be displayed on top of the plot. Defaults to None

    .. _Seaborn Documentation:
        http://seaborn.pydata.org/generated/seaborn.jointplot.html
    """
    sns.set(style="darkgrid", color_codes=True)
    g = sns.jointplot(x_var, y_var, data=data, kind="reg",
                      xlim=(0, data[x_var].max()), ylim=(0, data[y_var].max()), color="r", size=7)
    if title is not None:
        g.fig.suptitle(title)

In [4]:
# load data
data_path = 'data/baseballdatabank-2017.1/core'
filenames = os.listdir(data_path)
dataframes = load_files(data_path, filenames)

Analyse data

Main Table: Master


In [5]:
dataframes['Master'].head()


Out[5]:
playerID birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay ... nameLast nameGiven weight height bats throws debut finalGame retroID bbrefID
0 aardsda01 1981.0 12.0 27.0 USA CO Denver NaN NaN NaN ... Aardsma David Allan 215.0 75.0 R R 2004-04-06 2015-08-23 aardd001 aardsda01
1 aaronha01 1934.0 2.0 5.0 USA AL Mobile NaN NaN NaN ... Aaron Henry Louis 180.0 72.0 R R 1954-04-13 1976-10-03 aaroh101 aaronha01
2 aaronto01 1939.0 8.0 5.0 USA AL Mobile 1984.0 8.0 16.0 ... Aaron Tommie Lee 190.0 75.0 R R 1962-04-10 1971-09-26 aarot101 aaronto01
3 aasedo01 1954.0 9.0 8.0 USA CA Orange NaN NaN NaN ... Aase Donald William 190.0 75.0 R R 1977-07-26 1990-10-03 aased001 aasedo01
4 abadan01 1972.0 8.0 25.0 USA FL Palm Beach NaN NaN NaN ... Abad Fausto Andres 184.0 73.0 L L 2001-09-10 2006-04-13 abada001 abadan01

5 rows × 24 columns

Missing Values


In [6]:
rows_with_missing_values = len(dataframes['Master'][dataframes['Master'].isnull().any(axis=1)])
total = len(dataframes['Master'])

def bokeh_pie_chart(values, labels):
    """Pie chart
    
    Draws a Bokeh pie chart
    
    Args:
        values(list): List of Numpy arrays, one per type of data to show
        labels(list): List of labels matching amount of arrays in values argument
    """
    data = pd.Series(values, index=labels)
    pie_chart = Donut(data)
    show(pie_chart)

bokeh_pie_chart([rows_with_missing_values, total-rows_with_missing_values], ['Rows with missing values', 'Rows without missing values'])


As you can see more than 50% of entries, only from Master table (players), have missing values. I could not find any relevant missing values directly on tables that could directly affect my calculations. However I had to deal with missing relationships between players and schools tables as you will see underneath.

Report found answers

To answer any question about best players first I will have define what a best player is and how I will score it.

Best Player: Player who participated on most winning matches through time being part of one or more teams.

Player/Team Score: Ratio from the difference between Wins and Losses over amount of games played, assigned to every team.

Wins


In [7]:
# Wins calculation
team_wins_by_year = dataframes['Teams'].groupby(['teamID', 'yearID'], as_index=False)['teamID', 'W'].sum()
team_wins_by_year.sort_values('W', ascending=False).describe()
display(team_wins_by_year.describe())


yearID W
count 2835.000000 2835.000000
mean 1955.681834 74.814109
std 41.767356 17.591208
min 1871.000000 0.000000
25% 1920.000000 66.000000
50% 1964.000000 77.000000
75% 1993.000000 87.000000
max 2016.000000 116.000000

As you can see underneath, most of teams won between 50 to 100 games in the time window contained in this dataset. I took a 150 teams sample to show the tendency.


In [8]:
team_wins_by_year_sample = team_wins_by_year.sample(n=150)
fig = plt.figure(figsize=(16,10))
ax = sns.stripplot(x="teamID", y="W", data=team_wins_by_year_sample, size=10)
for item in ax.get_xticklabels():
    item.set_rotation(45)
ax.set(xlabel='Teams', ylabel='Wins')
fig.suptitle('Teams by Yearly Wins (based on sample of 150)', fontsize=16);



In [9]:
# Team Wins calculation
team_wins_total = team_wins_by_year.groupby('teamID', as_index=False)['W'].sum()
display(team_wins_total.describe())


W
count 149.000000
mean 1423.476510
std 2649.598666
min 0.000000
25% 43.000000
50% 118.000000
75% 1373.000000
max 10711.000000

Some statistical numbers about Team Wins

Losses


In [10]:
# Losses calculation
team_losses_by_year = dataframes['Teams'].groupby(['teamID', 'yearID'], as_index=False)['teamID', 'L'].sum()
team_losses_by_year.sort_values('L', ascending=False).describe()
display(team_losses_by_year.describe())
team_losses_by_year_sample = team_losses_by_year.sample(n=150)


yearID L
count 2835.000000 2835.000000
mean 1955.681834 74.814109
std 41.767356 17.331455
min 1871.000000 4.000000
25% 1920.000000 65.000000
50% 1964.000000 76.000000
75% 1993.000000 87.000000
max 2016.000000 134.000000

Now, showing below the amount of games that most of teams lost, the average is slightly higher (from 60 to 100).
Does it make sense? Well, not everyone can win...


In [11]:
fig = plt.figure(figsize=(16,10))
ax = sns.stripplot(x="teamID", y="L", data=team_losses_by_year_sample, size=8)
for item in ax.get_xticklabels():
    item.set_rotation(45)
ax.set(xlabel='Teams', ylabel='Losses')
fig.suptitle('Teams by Yearly Losses (based on sample of 150)', fontsize=16);



In [12]:
# Team losses calculation
team_losses_total = team_losses_by_year.groupby('teamID', as_index=False)['L'].sum()
display(team_losses_total.describe())


L
count 149.000000
mean 1423.476510
std 2570.942743
min 4.000000
25% 51.000000
50% 149.000000
75% 1575.000000
max 10741.000000

And some more statistical numbers about Team Losses

Defining Scores

Any score will be calculated as follows: (wins - losses) / n_games.


In [13]:
def calc_score(x):
    """Score calculation
    
    Calculates score based on Wins vs. Losses over amount of games
    
    Args:
        x(DataFrame): One column DataFrame
    Returns:
        float: Formula result
    """
    return (x['W'] - x['L']) / (x['W'] + x['L'])

Statistical information and first entries of Team Results


In [14]:
# Team results calculation
teams_results = team_wins_total.merge(team_losses_total, on='teamID', how='inner')
tscores = teams_results.apply(calc_score, axis=1)
teams_results['score'] = tscores.values
display(tscores.describe())
display(teams_results.head())


count    149.000000
mean      -0.116981
std        0.290331
min       -1.000000
25%       -0.260759
50%       -0.039216
75%        0.035993
max        0.663717
dtype: float64
teamID W L score
0 ALT 6 19 -0.520000
1 ANA 664 632 0.024691
2 ARI 1503 1575 -0.023392
3 ATL 4174 3944 0.028332
4 BAL 5130 4864 0.026616

Again, now displayed as a distribution, teams seem to be draw, but slightly negatively skewed. Which has the same meaning as the previous plot: there are more teams losing than winning.


In [15]:
distribution(tscores, xlabel='Score', ylabel='Teams', title='Team Score Distribution')


Best Teams

Since Teams are too many, I will take a sample using Pandas' quantile function set to 0.8 in order to get only the best 20% of best teams.


In [16]:
# Best teams calculation
best_teams = teams_results[teams_results['score'] > teams_results.quantile(q=0.8)['score']].sort_values('score', ascending=False)
display(best_teams.describe())


W L score
count 30.000000 30.000000 30.000000
mean 899.566667 693.766667 0.212257
std 2219.937413 1748.004334 0.142531
min 8.000000 4.000000 0.064220
25% 73.250000 47.250000 0.126745
50% 93.500000 61.000000 0.156710
75% 431.000000 335.000000 0.279741
max 10084.000000 7648.000000 0.663717

Now we have only 30 teams. If you check the previous figure the minimum score_norm was -1. Now it is 0.064, and max of 0.664.

Best Teams

After selecting only teams with higher scores the difference is quite significant. This plot is clearly positive skewed, showing that best scores are reserved only for the best of the best teams.


In [17]:
distribution(best_teams['score'].values, color='m', xlabel='Score', ylabel='Teams', title='Best Team Score Distribution')


Best Players

Score range goes now from -3 to 3.


In [18]:
# Player results calculation
players_results = dataframes['Appearances'].merge(teams_results, on='teamID', how='inner').groupby(['playerID'], as_index=False)['W', 'L', 'score'].sum()
pscores = players_results['score'].values
display(players_results.describe())


W L score
count 18912.000000 18912.000000 18912.000000
mean 33523.823551 32767.876428 -0.016261
std 36578.098625 35322.251635 0.321425
min 0.000000 4.000000 -3.027457
25% 9082.000000 8707.000000 -0.112395
50% 19248.500000 18836.000000 0.006084
75% 48576.250000 47457.000000 0.074130
max 245091.000000 258283.000000 3.117700

Now looking only at the best players, the distribution of score is perfectly symmetric.


In [19]:
distribution(pscores, xlabel='Score', ylabel='Players', title='Best Players Score Distribution')


Now looking at the best 2% of players the shape of this positively skewed distribution represents only a small portion on the right side of the previous plot.


In [20]:
# Players participating on games calculation
players_appearences = players_results.merge(dataframes['Appearances'].groupby('playerID', as_index=False)['teamID'].count(), on='playerID', how='left')
best_players_appearences = players_appearences[players_appearences['score'] > players_appearences.quantile(q=0.98)['score']]
distribution(best_players_appearences['score'].values, color='r', xlabel='Score', ylabel='Players', title='2% Best Players Distribution')


Question 1

Which school did winners go to?

Surprisingly, the best player regarding games played in winning teams, does not report school, as you can see below.


In [21]:
def set_to_str(x):
    """Set to string helper function
    
    Converts a set to string
    
    Args:
        x(set): Set to be represented as a string
    Returns:
        string: String representation of set
    """
    x = ','.join([str(c) for c in x])
    return x
def set_to_count(x):
    """Set to count helper function
    
    Converts a set into the count of its elements
    
    Args:
        x(set): Set to be counted
    Returns:
        int: Number of elements contained in the set
    """
    if str(x) == '{nan}':
        return 0
    return len(x)
schools_of_winners = best_players_appearences.merge(dataframes['CollegePlaying'], on='playerID', how='left')
best_players_appearences_with_schools = best_players_appearences.merge(schools_of_winners, on='playerID', how='left').groupby('playerID', as_index=False).agg({'schoolID': lambda x: set(x)})
best_players_appearences_with_schools['school_count'] = best_players_appearences_with_schools['schoolID'].apply(set_to_count)
display(best_players_appearences_with_schools.merge(dataframes['Master'], on='playerID', how='left')[['playerID', 'nameGiven', 'schoolID', 'school_count']].head())


playerID nameGiven schoolID school_count
0 aaronha01 Henry Louis {nan} 0
1 adcocjo01 Joseph Wilbur {lsu} 1
2 alexado01 Doyle Lafayette {nan} 0
3 allenjo02 John Thomas {nan} 0
4 aloufe01 Felipe Rojas {nan} 0

Almost none of best players report schools

Missing Relationships

I will have to exclude player entries with no schools in order to generate a ranking of best schools, based on best players who have reported one or more schools. To do so, I will count schools for each player, and then exclude those entries with school equal to zero.

Top10 Schools regarding best players


In [22]:
best_players_with_school_and_score = best_players_appearences.merge(best_players_appearences_with_schools, on='playerID', how='left')
top10_schools = best_players_with_school_and_score[best_players_with_school_and_score['school_count'] > 0].sort_values('score', ascending=False).apply({'schoolID': lambda x: ','.join(x)}).iloc[0:10]
display_1d_array(top10_schools.values, col_name='School')


School
1 columbia
2 alcalho
3 pennst
4 txsjjcn
5 ulala
6 bucknell
7 stbonny
8 fordham
9 casjcco
10 sliprock

List ot top 10 schools

Question 2

Did best players get the highest salaries?

Apparently participating in more winning games doesn't have a strict correlation with highest salaries, but they are still in a massive salaries list. To calculate the salaries, I took the maximum salary for each player/team.

As you can see underneath, highest salaries didn't go to players of teams with more successful games.


In [23]:
def calculate_mean_salary(x):
    count = len(x)
    return ((1/count) * np.sum(x)/1000000)

best_players_salaries = dataframes['Salaries'].merge(best_players_with_school_and_score, on='playerID', how='right')[['playerID', 'salary', 'yearID', 'score']].dropna().sort_values(['salary'], ascending=False)
bpsalaries = best_players_salaries['salary'].values
#bscore_norm = MinMaxScaler().fit_transform(bpsalaries.astype(np.float64).reshape(-1, 1)).flatten()
#best_players_salaries['salary_norm'] = bpscore_norm

highest_salaries_players_with_score = best_players_salaries.groupby('playerID', as_index=False).agg({'salary': calculate_mean_salary}).merge(players_results, on='playerID', how='left')
display(highest_salaries_players_with_score[['salary', 'score']].describe())


salary score
count 81.000000 81.000000
mean 3.393315 1.100684
std 3.968022 0.428036
min 0.139167 0.720521
25% 0.774000 0.824273
50% 1.620000 0.978056
75% 5.037919 1.188719
max 18.109830 2.747575

Statistical information of best players scores vs. salaries (expressed in millions of USD)


In [24]:
max_salary_idx = highest_salaries_players_with_score[['salary']].idxmax()
max_score_idx = highest_salaries_players_with_score[['score']].idxmax()
print('Highest salary (millions)')
display(highest_salaries_players_with_score.loc[max_salary_idx].merge(dataframes['Master'], on='playerID', how='left')[['playerID', 'nameGiven', 'salary', 'score']])
print('Highest score')
display(highest_salaries_players_with_score.loc[max_score_idx].merge(dataframes['Master'], on='playerID', how='left')[['playerID', 'nameGiven', 'salary', 'score']])


Highest salary (millions)
playerID nameGiven salary score
0 rodrial01 Alexander Enmanuel 18.10983 1.188719
Highest score
playerID nameGiven salary score
0 jeterde01 Derek Sanderson 13.927268 2.747575

Here is the difference between the player getting the highest salary and the one participating the more winning games.


In [25]:
plot_correlation2(data=highest_salaries_players_with_score, x_var="score", y_var="salary", title='Correlation between Salary and Score')


Do best players get higher salaries? It is pretty difficult to see using only the variables on this report. It might be interesting to include more variables like manager biases or getting a better normalization of salaries (money does not have the same value now and 100 years ago).

Question 3

Is there a correlation between winning teams and managers?

The plot displaying the distribution of manager scores seems to be perfectly symmetric. Which simply means that there are managers for all teams, from worst to best.


In [26]:
managers = dataframes['Managers']
mscores = managers.apply(calc_score, axis=1)
managers['score'] = mscores
managers = managers.groupby('playerID', as_index=False).sum()[['playerID', 'score', 'rank']]
best_managers = managers.sort_values('score', ascending=False)
distribution(mscores, color='g', xlabel='Score', ylabel='Managers', title='Manager Score Distribution')


It seems that best teams were managed by the best managers (or managers are considered the best because they won more games). This shows me how important leadership is to succeed.


In [27]:
manager_rank_score_corr = best_managers.sort_values('score', ascending=False).iloc[0:100][['playerID', 'score', 'rank']]
plot_correlation(x=manager_rank_score_corr['rank'], y=manager_rank_score_corr['score'], xlabel='Rank', ylabel='Score', title='Correlation between Manager Score and Ranking')


Limitations

I think there are many variables that I could include to see whether data could tell us anything more interesting, perhaps using Pitching, Fielding and Batting tables. Even curious patterns like birth or death places, or even doing the opposite job, like finding patterns correlating with losing teams.

The following and last plot shows where players die more often. Please don't play baseball in Philadelphia!!!


In [28]:
player_death_city = dataframes['Master'][['playerID', 'deathCity']].dropna().merge(players_results, on='playerID', how='left')
worst_players = player_death_city.sort_values('score')[['playerID', 'deathCity', 'score']].dropna()
top10_worst = worst_players[worst_players['score'] < worst_players.quantile(q=0.1)['score']].sort_values('score').groupby('deathCity', as_index=False)['playerID'].count().sort_values('playerID', ascending=False).iloc[0:10]
#display(top10_worst.iloc[0:10])
f, ax = plt.subplots(figsize=(10, 7))
sns.barplot(x="deathCity", y="playerID", data=top10_worst)
#sns.stripplot(x="deathCity", y="playerID", data=top10_worst)
ax.set(xlabel='Death City', ylabel='Amount of Dead Players')
f.suptitle('Just for fun: Most deadly cities to play baseball');


Conclusion

I've found that the best-reported player was Henry Louis (not reporting school).
To answer the questions I had at the beginning, I've found that Columbia is the school with more best players. However, there were many missing relationships between players and the schools they went. This means that this school is only the best after ignoring best players who didn't report schools.

About highest salaries, Alexander Emmanuel seems to be the luckiest one with more than USD 18M. On the other hand, the one with the best-reported performance was Derek Sanderson.

Now, the most important fact, in my opinion, is the last question: Is there a correlation between best players and managers? Well, there is. I've found that leadership seems to be strongly involved in success: Players and teams were better with best-ranked managers.

Dataset Reference

Sean Lahman, Award Winning Database Journalist and Author.

Website: http://www.seanlahman.com/baseball-archive/statistics/